Although the mapper and reducer implementations are all we need to perform the MapReduce job, there is one more piece of code necessary in MapReduce: the driver which communicates with the Hadoop framework and specifies the configuration elements required to run a MapReduce job. This involves aspects such as telling Hadoop which Mapper and Reducer classes to use, where to find the input data and in what format, and where to place the output data and how to format it. There is an additional variety of other configuration options that can be set here.
There is no default parent Driver class as a subclass; the driver logic usually exists in the main method of the class written to encapsulate a MapReduce job. Take a look at the following code snippet as an example driver. Don't worry about how each line works, though we should be able to work out generally what each is doing:
public class ExampleDriver
{
...
public static void main(String[] args) throws Exception
{
// Create a Configuration object that is used to set other options
Configuration conf = new Configuration() ;
// Create the object representing the job
Job job = new Job(conf, "ExampleJob") ;
// Set the name of the main class in the job jar file
job.setJarByClass(ExampleDriver.class) ;
// Set the mapper class
job.setMapperClass(ExampleMapper.class) ;
// Set the reducer class
job.setReducerClass(ExampleReducer.class) ;
// Set the types for the final output key and value
job.setOutputKeyClass(Text.class) ;
job.setOutputValueClass(IntWritable.class) ;
// Set input and output file paths
FileInputFormat.addInputPath(job, new Path(args[0])) ;
FileOutputFormat.setOutputPath(job, new Path(args[1]))
// Execute the job and wait for it to complete
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
It is not surprising that much of the setup involves operations on a Job object. This includes setting the job name and specifying which classes are to be used for the mapper and reducer implementations.
Certain input/output configurations are set and, finally, the arguments passed to the main method are used to specify the input and output locations for the job. This is a very common model that we will see often.
There are a number of default values for configuration options, and we are implicitly using some of them in this class. Most notably, we don't say anything about the file format of the input files or how the output files are to be written. These are defined through the InputFormat and OutputFormat classes. The default input and output formats are text files. There are multiple ways of expressing the format within text files in addition to particularly optimized binary formats.
A common model for less complex MapReduce jobs is to have the Mapper and Reducer classes as inner classes within the driver. This allows everything to be kept in a single file, which simplifies the code distribution.
Sunil Singh
31-May-2017It was really helpful to read this post.